A Variable Photo-Model Method for Object Pose and Size Estimation with Stereo Vision in a Complex Home Scene

Model-based stereo vision methods can estimate the 6D poses of rigid objects. They can help robots to achieve a target grip in complex home environments. This study presents a novel approach, called the variable photo-model method, to estimate the pose and size of an unknown object using a single photo of the same category. By employing a pre-trained You Only Look Once (YOLO) v4 weight for object detection and 2D model generation in the photo, the method converts the segmented 2D photo-model into 3D flat photo-models assuming different sizes and poses. Through perspective projection and model matching, the method finds the best match between the model and the actual object in the captured stereo images. The matching fitness function is optimized using a genetic algorithm (GA). Unlike data-driven approaches, this approach does not require multiple photos or pre-training time for single object pose recognition, making it more versatile. Indoor experiments demonstrate the effectiveness of the variable photo-model method in estimating the pose and size of the target objects within the same class. The findings of this study have practical implications for object detection prior to robotic grasping, particularly due to its ease of application and the limited data required.


Introduction
For home service robots, vision systems are widely used in the perception of environment target objects [1]. Estimating an object's 6DOF pose and size is important for autonomous robots to track or grasp it. Stereo vision is a widely adopted and low-cost method for estimating a 3D pose. Compared with RGB-D sensing, it perceives a greater variety of target material properties and light conditions [2,3]. However, detecting the 3D pose of arbitrary objects has remained a challenge, particularly when the shape or size of the target object cannot be predetermined.
In terms of the pose detection, stereo-vision methods can be roughly divided into stereo-matching and model-matching methods. Stereo matching, also known as disparity estimation, aims to find the corresponding points of a physical point in a pair of rectified stereo images. Furthermore, through epipolar geometry, stereo vision computes the 3D coordinates of this physical point (2D-3D method). According to the number of matching points, they are divided into feature-based [3] and point-cloud-based methods.
Feature-based methods only match some feature points of the target and take the pose estimation with these points [4][5][6]. Point-cloud-based methods generate a scene point cloud, which can be seen as a global extension of feature-based methods. They use 2D image object detection to segment the corresponding point cloud for pose detection. However, it is generally necessary to organize and structure the 3D discrete points into a higher-level representation, such as voxels [7,8]. Removing mismatched noise points and identifying and segmenting target objects in point clouds are complex problems [9]. However, no matter which method is used, mismatches are inevitable.
Model-based matching methods, also known as template-based methods, can avoid mismatches and are also suitable for occlusion situations [10][11][12]. All the points of a solid 3D model as a group are projected into stereo-vision image planes and are matched with the actual target (3D-2D method). Model generation is a difficult task, relying on the model's style and size. Some learning-based methods detect objects in a 2D image and then segment the RGB-D point cloud to create a 3D model [13]. However, the size of the models is difficult to change. Several researchers have used deformable models combined with stereo vision to measure the size of tuna with excellent results [14]. However, the complexity of the model building limits the generality of this method in detection.
We have previously proposed a photo-model-based pose estimation method. This method involves segmenting the target object from a photo and constructing a 2D photomodel of it. A 3D photo-model is generated from the 2D photo-model. The pose-changed 3D photo-model is projected onto stereo-vision image planes, and matches are made with the actual target. This process can be summarized as 2D-3D-2D [15]. Experiments have proven the reliability and effectiveness of the photo-model approach for pose estimation using one known distance photo [16].
However, this method required photographing an object of unknown size at a specific distance in order to determine the pixel/metric (PM) ratio. From this ratio, the object's actual size was calculated and a 2D photo-model generated. We also experimentally demonstrated that the pose of an object can be estimated and tracked in real time [16].
The PM ratio is an important parameter for building a 3D photo-model from the 2D photo with the same size as the object [17,18]. Other studies usually rely on camera calibration with reference objects of known size to ensure this ratio [19,20]. However, suppose the shooting distance of the photos is unknown or there is no reference object; in these cases, they cannot obtain the PM ratio. In the work described in this paper, no special photos are required. The proposed method assumes the PM ratio and converts the 2D photo-model into variable 3D plane photo-models. Through stereo-vision model matching and a genetic algorithm (GA), it can assure the object's pose and size at the same time.
On the other hand, in our previous studies, 2D photo-model making relied on the threshold segmentation of simple background photos [15,16]. However, the threshold value needed to be reset when the background changed. Due to the development of modern deep learning techniques, object detection in 2D photos has achieved good results in different contexts [21]. This study uses the training results of YOLOv4 [22] on the MS COCO dataset (https://github.com/AlexeyAB/darknet#how-to-evaluate-ap-of-yolov4-on-thems-coco-evaluation-server, accessed on 20 September 2022) to detect the object and simplify the 2D photo-model generation process. Size-variable 3D photo-models are generated from a 2D photo by assuming the PM ratio of the pixel length to the actual length of the object. Since the prepared photo does not involve multiple classes, and the production process does not require real-time capabilities, the widely used algorithm YOLOv4 is selected for this purpose [22]. During the experiment, YOLOv8 had not been released yet [23]; thus, it is not utilized in this paper. Additionally, the Transformer algorithm has also demonstrated excellent performance in object detection [24]. However, the main focus of this paper is not on 2D object detection but rather on determining whether the spatial dimensions of the generated photo-models can be used for the pose and size detection of similar objects. In the subsequent experiments, it was found that the YOLOv4 model effectively detected and accurately outlined the objects in the prepared photos.
In terms of 3D pose detection, the proposed variable photo-model method belongs to the model-based matching method not a data-driven method; hence, it requires no additional training [25,26], and it only needs to run on CPUs with limited hardware. Using the similarity factor of the matching degree of the projected model in the left and right images, we constructed a new photo-model matching function. We hope to improve the existing photo-model-based algorithms and lay a good foundation for future research on visual servo systems.
With an industrial product and a piece of fruit, pose-size detection experiments were conducted to verify the effectiveness of the proposed method for daily life. According to the results, with only one category of photo, the target's pose and size could be estimated.
More precisely, the contributions of this paper are as follows: (1) This paper allows the utilization of photos taken at unknown distances for model generation. It extends the traditional photo-model-based approach; (2) With just one photo, this method enables the generation of 3D plane models with varying aspect ratios and sizes, which can be used for object pose estimation; (3) The variable photo-model method combines deep learning techniques to simplify the traditional algorithm model creation process. It leverages pre-trained weights from existing datasets, eliminating the need for additional training. One of its advantages is that it can be executed on a CPU with limited hardware resources.
The rest of the present paper is organized into the following sections: Section 2 provides an overview of the relevant literature and previous studies. Section 3 presents variable photo-model generation and the photo-model pose and size estimation method. In Section 4, we discuss the adaptability of the proposed method for recognizing an object's pose and size according to the experimental results. The conclusions and future work are described in Section 5.

Related Work
Regarding partial occlusion, several previous studies [15,27] have explored different environmental factors affecting its handling. These studies provide experimental evidence to support the effectiveness of the photo-model approach [15,27].
Furthermore, in handling different lighting conditions, the practicality of the photomodel-based method was tested experimentally [28]. The experiments focused on two common light sources: fluorescent and light-emitting diode (LED) lighting. The method's ability to tolerate changes in illumination for object recognition was analyzed, and the results demonstrated its robustness in handling different light sources and levels of illumination. Additionally, a visual servo system was developed for capturing marine creatures [29]. The adaptability of the photo-model method to these factors will not be discussed further in this article.
On the other hand, research on 3D indoor object detection using stereo images is still limited. There is a model-based approach that utilizes object model projections on synthetic and real datasets to train networks to detect object poses [30]. However, most existing datasets for pose estimation rely on RGB-D data rather than binocular vision [31]. Furthermore, while there have been studies exploring the use of infrared (IR) stereo imaging for vegetable classification [32], the available stereo benchmark datasets primarily consist of RGB imagery and lack object size information. This lack of comprehensive benchmark datasets has led many studies in stereo vision pose estimation to rely on their own targetspecific datasets instead of publicly available benchmarks [33,34]. As a result, it is common for researchers in the field of stereo vision pose estimation to utilize their own datasets.
In the next section, related work in the field of photo-model-based methods and object pose detection is reviewed. The limitations of existing databases are also discussed. Regarding 3D pose detection, the variable photo-model method belongs to the model-based matching approach and does not rely on extensive data-driven techniques [25,26,34]. This eliminates the need for additional training and allows the method to run efficiently on CPUs with limited hardware resources. The approach combines both deep learning techniques and traditional methods.

Variable Photo-Model Pose and Size Detection Method
This section introduces the variable photo-model pose and size detection methodology. Figure 1a shows the experimental environment. Each coordinate system is as follows: •

Variable Photo-Model Generation
This subsection describes the model generation before explaining the stereo-matching method. The model generation has two central parts. The first part is to generate a fixed 2D pixel model in pixel units. The latter is a 3D plane model generation; its size (length and width) in millimeters is variable. Estimation of the relative pose requires the use of the generated 3D planar model. Figure 2 shows the model generation process. We did not take a photo of the target pear, but downloaded one photo ( Figure 2a) from Bing Images. Figure 3a shows the actual target. Furthermore, Figure 3b shows the downloaded photos. The pre-trained YOLOv4 weight in the existing MS COCO dataset is used to detect the object in the photo. The bounding box is defined as the model frame ( Figure 2b). Figure 4a shows the coordinate system of the model Σ P . The size of the 2D model frame is L P × B P pixels, i.e., the 2D photomodel pixel size. The outer portion's size is larger than the model frame size. Sampling points are taken in the model at a regular pixel interval ( Figure 2c). The coordinate of the i-th sampling point in the 2D pixel coordinate system in Σ P is In order to explore the object, the photo-model needs to be converted from a 2D pixel model to a 3D spatial plane model. The coordinate of the i-th point of the j-th model M r As shown in Figure 1b, Equation (3) indicates the conversion relationship of the i-th sampling point between Σ P in Figure 4 and Σ Mj in Figure 1b.
where • α j : PM ratio of the j-th model in the x direction; • β j : PM ratio of the j-th model in the y direction [20]. The PM ratio unit is (pixel/mm). It is the ratio of the 2D pixel model to the 3D spatial plane model. α M , β M are defined as the real ratio of the 2D pixel model to the target object. The relationship between α j and β j is where k j is the ratio factor. For instance, in Figure 4, at the moment when i = 109 and j = 1, the calculations are as . For the j-th 3D spatial plane model, its length and width are calculated as in Equation (5).
Equation (5) converts the 2D pixel model into a 3D spatial plane model. The thickness of the model is M z i = 0; therefore, the resulting 3D photo-model is a 3D space plane. In this study, M r j i is developed and can be described as the function of α j , k, i.e., M r j i (α j , k j ). The 3D plane model is composed of dots whose relative positions are predefined as in Figure 4.     x

Model inner portion
Model outer portion

Projective Transformation of the Photo-Model
The projective transformation of the fixed photo-model has been proposed in our previous paper [16,35]. In the past, since M r j i is generated from the original object's photo, it is a size-fixed model, and its size is the same as the real target. In this paper, M r j i is a variable photo-model, and thus a function of the PM ratio.
As shown in Figure 1a, the pose of Σ M C01 based on Σ H , including three position variables and three orientation variables in quaternion [16], is As shown in Figure 1b, based on Σ H , the pose of the j-th 3D model H φ j M is defined as which has been explained in previous studies [16,35].  [36].
Concerning stereo vision, position CL r j i of the i-th point based on Σ CL can be calculated through Equation (8), On the j-th 3D model using the projective transformation matrix Then IL r j i can be described in short as where IR r j i can also be described in the same manner as IL r j i . The projective transformation process is summarized in Figure 5a, i.e., 2D-3D-2D [15]. The projection calculation process of the C02 photo-model is the same as that of C01. The series of equations from Equations (1) to (10) presents a detailed and systematic procedure for a 2D-3D-2D process. This process begins by generating a 3D photo-model utilizing a single photo, culminating in mapping pose transformations to dual eye images.
On j-th 3D model by using the projective transformation matrix P CL , CL r Then IL r j i can be described in short as where IR r j i can also be described as the same manner like IL r j i . The projective transformation process is summarized in Figure 5a, i.e., 2D-3D-2D [15]. The projection calculation process of C02 photo-model is the same as that of C01. The series of equations from Equation (1) to Equation (10) presents a detailed and systematic procedure for a 2D-3D-2D process. This process initiates by generating a 3D photo-model utilizing a single photo and culminates in mapping pose transformations to dual eye images.

Photo-Model Matching and Spatial Fitness Function
In Figure 1b  The HSV color representation is used for the extraction of the target color (Figure 2d). The advantage of HSV is that each of its attributes correspond directly to the basic color concepts, which makes it conceptually simple. In addition, the hue of the HSV color system shows good robustness against a change in the lighting intensity.
The fitness function is defined as an evaluation of how well the projection model matches the real target in images captured by the binocular camera, i.e., the similarity measurement.
The symbols related to function computation are explained as follows:  Table 1;  Table 1; The average hue of the sampling points in the rectangle BECG in Figure 6. This is used as the evaluation threshold for the addition or subtraction  (12) and (13) are the designed fitness between the target captured by stereo cameras and the projected j-th model on the left and right images, respectively, [16].
In a single image, left or right, the theoretical maximum fitness of the projected j-th model is Equations (15) and (16) are used to calculate p ij L,in and p ij L,out , respectively, which are included in Equation (12) as proposed previously [16]. (16) Figure 7 shows a generated photo-model placed in the 3D searching space and the left and right 2D searching models that are projected, respectively, from photo-model with the pose and size being assumed to be Φ j M . Figure 8 illustrates the calculation process of the evaluation value p ij L,in for the inner sampling point, including the color judgment process for C ij IL and C ij ML of one inner point. This is a continuous judging process [37].
x P x P y P y P § P § P

Unit: pixel
Model frame (c) Spatial fitness of j-th model We divide the colors into four categories: black, white, gray, and other for similarity judgment. For grayscale, it is necessary to judge whether the sampling point color C ij ML is close to the point color C ij IL in the captured image with S and V. For other colors, we only compare their H values.
The algorithm complexity for determining the evaluation value of each individual sampling point (i-th point) based on color similarity is considered constant, with a time complexity of O(1). Therefore, the algorithm complexity of Figure 8 can be regarded as O (1). For each photo-model (j-th photo-model), the fitness calculation complexity in Equation (12) Figure 6 shows the average hueH in of the sampling points in the inner rectangle BECG, which is used as the evaluation threshold of the outer portion sampling point p ij L,out or p ij R,out . Figure 5b shows the j-th model by 3D to 2D projection on the left image. The coordinates of the sampling points are indicated as · · · , IL r j i−1 , IL r j i , IL r j i+1 · · · . In Equation (15) and Figure 8, if the color C ij IL of each point of the captured images, which lies inside the surface model frame S L,in , is similar to the color C ij ML of each point in a model, the fitness value will increase with the voting value of e 1 . These sampling points are represented by dots designated by (A) in Figure 5b. The fitness value will decrease with the value of e 2 for every model inner portion point when C ij ML is different from C ij IL in the left camera image. This represents that the model does not precisely overlap the target in the input image, represented by (B) in Figure 5b.
Similarly, in Equation (16), if H ij IL of a point in S L,out in the left camera image is different with the average hueH in of the target, with a tolerance of 20, the fitness value will increase with the value of e 3 . This means the S L,out strip area surrounding S L,in overlaps with the background, expressing the model and the target overlap correctly as (C) in Figure 5b. Otherwise, the fitness value will be decreased with the value of e 4 . This represents points on S L,out that overlaps with the real target as (D) in Figure 5b.
Likewise, functions p ij R,in and p ij R,out are calculated in the right camera image. As shown in Figure 7, to minimize the adverse effect of the high model matching values on pose detection in a single-sided image, a similarity factor is proposed in this study. This factor, denoted as g j , is designed as follows: where µ = 1 and σ = 0.08. The value of g j is limited to the range [0, 1]. Higher values of g j indicate closer values of F j L and F j R . In the end, the stereo matching fitness of the j-th model is calculated as Figure 1a shows the experimental environment. The stereo camera is a ZED 2i. The resolution of the stereo images is 1920 × 1080 pixels. The PC is a Lenovo Legion Y70002021 (CPU: i5-11400H, 2.70 GHz; RAM: 16 GB).

Pose-Size Estimation Experiment with the Genetic Algorithm
A pose and size detection experiment was conducted in a real application scenario. Figure 9a shows the images observed by the stereo camera. Using the same left and right photos (Figure 9a), two separate experiments were conducted, each with only one target, a pear and a sunscreen.
The fitness function F j (Φ j M ) transforms the detection problem into an optimization problem of the pose and ratio Φ j M [16]. We choose the GA as an optimization method to find the maximum fitness value because of its simplicity and effectiveness [16,38]. According to the GA, the 3D models with random poses and ratios generated from the prepared photos converge to target objects in 3D space. Te GA stops evolving after the 1000th generation.
As shown in Equation (19)    (1) Firstly, the individuals are randomly generated in the 3D searching area as the first generation; (2) New images captured by dual-eye cameras are input; (3) The fitness value of every individual is calculated; (4) Every individual's fitness value is sorted by the calculated fitness value; (5) The best individual is selected from the current population, and the weak individuals are removed; (6) Then, the individuals for the next generation are reproduced by performing crossover and mutation between the selected individuals; (7) Only new individuals in the next generation are evaluated by the fitness function, shown in "Evaluation (2)" block, because the right and left images do not change and the top individuals with the highest fitness do not need to calculate fitness again since the image is constant; (8) The above process is repeated until the desired generation is reached. Finally, the GA outputs the best individuals of the 100th, 500th, and 1000th generation, and then terminates the evolutionary process. . The "Measure" row corresponds to the actual sizes and positions of the targets, which were measured using a manual tape measure. By the 1000th generation, the experimental results closely matched the actual values. The detected object's pose, length, and width exhibited a close resemblance to their actual counterparts. It is worth noting that the unitless orientation in quaternion represents the pose, and the actual orientation of the targets remains unknown. Table 2. Pear C01 GA's detection results. Through perspective transformation, the projection results of the model on the left and right images are shown in Figure 9(b1-d1). The table's "Measure" row shows the target's measurement under the tape measure.  Table 3. Sunscreen C02 GA's detection results. Through perspective transformation, the projection results of the model on the left and right images are shown in Figure 9(b2-d2). The In Table 2, the last row shows the distance and size relative errors. From the table, we can observe that the distance error e zC01 is less than 2 cm. In Table 3, we can observe that the distance error e zC02 is also less than 2 cm.
The measurement results for the sunscreen (Table 3) outperform those for the pear ( Table 2). Although the datasets are different [32,39], there are still comparable aspects in terms of object size and pose detection. The pear's results demonstrate slightly lower accuracy compared to the measurements reported in [32]. On the other hand, the sunscreen's results exhibit better performance than the corresponding distance measurements presented in [32], despite the lack of object pose detection in that study. Notably, it is worth noting that the pose errors for both objects are similar to the results highlighted in [39].
For both the sunscreen and the pear, the distance z detection error is less than 2 cm. Table 3 shows that the GA has already found an optimal solution in the 500th generation, which is the same as in the 1000th generation. This indicates that the algorithm has successfully converged to the best possible solution. Regarding the pear in Table 2, the orientation ε 3 at the 1000th generation is −0.77, which is less than −0.5 and indicates a reverse rotation around the Z M axis of more than 90 degrees. However, the actual pose of the pear is lying horizontally and only rotated by less than 90 degrees. The pose detection result is close to the actual pose.
The comparison with other methods is shown in Table 4. Orientation errors are transformed from quaternion to Euler angles (e 1 , e 2 , e 3 ) for comparison. Qualitative analysis was performed as above on the pear orientation detection. In general model-based methods, it is assumed that the model has the same size as the object, resulting in no size errors ∆L and ∆B [30,39]. For comparison, we examined findings related to the PM ratio [20] or stereo vision [32] for size measurements, although these studies did not perform pose measurements. Our method can be regarded as comparable to other reliable methods in terms of size and pose measurements. On average, it falls into the upper middle level of accuracy. Furthermore, our method is capable of reliably estimating both size and pose.
Through the experimental results, it is confirmed that: (1) The proposed variable photo-model-based recognition method utilizes stereo vision and a 2D photo to estimate the pose of a 3D target object, extending the traditional approach; (2) This method can generate 3D plane models with varying aspect ratios and sizes using just one photo, enabling accurate object pose estimation; (3) The variable photo-model method combines deep learning techniques, utilizing pretrained weights from existing datasets, and can be executed on a CPU with limited hardware resources. Table 4. Position (mm), orientation (degrees), and size relative errors. In general model-based methods, it is assumed that the model is the same size as the object with no dimensional errors. Results of studies using PM ratio [20] or stereo vision [32] for size measurements are also included in the

Conclusions and Future Work
The study presented a pose and size estimation method using the variable photomodel. The experimental results using two different objects demonstrated that the generated variable PM ratio photo-model was able to detect the objects' pose and size in a complex home environment. The accuracy was found to be better for the sunscreen compared to the pear. The adaptability of the variable photo-model method to different target shapes was also observed when using a photo from the same category.
The fact that the detection performance is better for industrially manufactured products (sunscreen) with fixed shapes compared to an agricultural product (pear) with irregular shape variations suggests that the method's ability to handle shape variations is not sufficiently refined and requires improvement.
In terms of future research, it is recommended to include a wider variety of experimental objects to enhance the generalizability of the findings. Moreover, conducting information extraction from existing datasets for comparative studies would provide valuable insights. Furthermore, the impact of different deep learning models on the generation of photo-models should be thoroughly investigated and analyzed. Data Availability Statement: Data sharing is not applicable to this article, as this study has presented all data.

Conflicts of Interest:
The authors declare no conflicts of interest.